Overview

Dataset statistics

Number of variables14
Number of observations891
Missing cells687
Missing cells (%)5.5%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory381.0 KiB
Average record size in memory437.9 B

Variable types

CAT7
NUM6
BOOL1

Reproduction

Analysis started2020-05-18 18:26:38.907414
Analysis finished2020-05-18 18:26:47.434904
Duration8.53 seconds
Versionpandas-profiling v2.7.1
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
Cabin has a high cardinality: 147 distinct values High cardinality
Name has a high cardinality: 891 distinct values High cardinality
Ticket has a high cardinality: 681 distinct values High cardinality
Title is highly correlated with SexHigh correlation
Sex is highly correlated with TitleHigh correlation
Cabin has 687 (77.1%) missing values Missing
Cabin is uniformly distributed Uniform
Name is uniformly distributed Uniform
PassengerId is uniformly distributed Uniform
Ticket is uniformly distributed Uniform
Name has unique values Unique
PassengerId has unique values Unique
Fare has 15 (1.7%) zeros Zeros
Parch has 678 (76.1%) zeros Zeros
SibSp has 608 (68.2%) zeros Zeros
Family_Size has 537 (60.3%) zeros Zeros

Variables

Age
Real number (ℝ≥0)

Distinct count89
Unique (%)10.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean29.44519640852974
Minimum0.42
Maximum80.0
Zeros0
Zeros (%)0.0%
Memory size7.1 KiB

Quantile statistics

Minimum0.42
5-th percentile5
Q122
median30
Q335.5
95-th percentile54
Maximum80
Range79.58
Interquartile range (IQR)13.5

Descriptive statistics

Standard deviation13.24489592
Coefficient of variation (CV)0.4498151663
Kurtosis0.7882478756
Mean29.44519641
Median Absolute Deviation (MAD)8
Skewness0.4328988034
Sum26235.67
Variance175.4272679
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
30 144 16.2%
 
22 63 7.1%
 
24 30 3.4%
 
18 26 2.9%
 
28 25 2.8%
 
19 25 2.8%
 
21 24 2.7%
 
25 23 2.6%
 
36 22 2.5%
 
29 20 2.2%
 
Other values (79) 489 54.9%
 
ValueCountFrequency (%) 
0.42 1 0.1%
 
0.67 1 0.1%
 
0.75 2 0.2%
 
0.83 2 0.2%
 
0.92 1 0.1%
 
ValueCountFrequency (%) 
80 1 0.1%
 
74 1 0.1%
 
71 2 0.2%
 
70.5 1 0.1%
 
70 2 0.2%
 

Cabin
Categorical

HIGH CARDINALITY
MISSING
UNIFORM
Distinct count147
Unique (%)72.1%
Missing687
Missing (%)77.1%
Memory size7.1 KiB
B96 B98
 
4
G6
 
4
C23 C25 C27
 
4
C22 C26
 
3
D
 
3
Other values (142)
186
ValueCountFrequency (%) 
B96 B98 4 0.4%
 
G6 4 0.4%
 
C23 C25 C27 4 0.4%
 
C22 C26 3 0.3%
 
D 3 0.3%
 
F2 3 0.3%
 
E101 3 0.3%
 
F33 3 0.3%
 
D20 2 0.2%
 
D36 2 0.2%
 
Other values (137) 173 19.4%
 
(Missing) 687 77.1%
 

Length

Max length15
Mean length3.134680135
Min length1
ValueCountFrequency (%) 
Decimal_Number 10 47.6%
 
Uppercase_Letter 8 38.1%
 
Lowercase_Letter 2 9.5%
 
Space_Separator 1 4.8%
 
ValueCountFrequency (%) 
Common 11 52.4%
 
Latin 10 47.6%
 
ValueCountFrequency (%) 
ASCII 21 100.0%
 

Embarked
Categorical

Distinct count3
Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
S
645
C
169
Q
 
77
ValueCountFrequency (%) 
S 645 72.4%
 
C 169 19.0%
 
Q 77 8.6%
 

Length

Max length1
Mean length1
Min length1
ValueCountFrequency (%) 
Uppercase_Letter 3 100.0%
 
ValueCountFrequency (%) 
Latin 3 100.0%
 
ValueCountFrequency (%) 
ASCII 3 100.0%
 

Fare
Real number (ℝ≥0)

ZEROS
Distinct count248
Unique (%)27.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean32.204207968574636
Minimum0.0
Maximum512.3292
Zeros15
Zeros (%)1.7%
Memory size7.1 KiB

Quantile statistics

Minimum0
5-th percentile7.225
Q17.9104
median14.4542
Q331
95-th percentile112.07915
Maximum512.3292
Range512.3292
Interquartile range (IQR)23.0896

Descriptive statistics

Standard deviation49.6934286
Coefficient of variation (CV)1.543072528
Kurtosis33.39814088
Mean32.20420797
Median Absolute Deviation (MAD)6.9042
Skewness4.78731652
Sum28693.9493
Variance2469.436846
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
8.05 43 4.8%
 
13 42 4.7%
 
7.8958 38 4.3%
 
7.75 34 3.8%
 
26 31 3.5%
 
10.5 24 2.7%
 
7.925 18 2.0%
 
7.775 16 1.8%
 
26.55 15 1.7%
 
0 15 1.7%
 
Other values (238) 615 69.0%
 
ValueCountFrequency (%) 
0 15 1.7%
 
4.0125 1 0.1%
 
5 1 0.1%
 
6.2375 1 0.1%
 
6.4375 1 0.1%
 
ValueCountFrequency (%) 
512.3292 3 0.3%
 
263 4 0.4%
 
262.375 2 0.2%
 
247.5208 2 0.2%
 
227.525 4 0.4%
 

Name
Categorical

HIGH CARDINALITY
UNIFORM
UNIQUE
Distinct count891
Unique (%)100.0%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
de Messemaeker, Mrs. Guillaume Joseph (Emma)
 
1
Oreskovic, Mr. Luka
 
1
Dorking, Mr. Edward Arthur
 
1
Balkic, Mr. Cerin
 
1
Hocking, Mr. Richard George
 
1
Other values (886)
886
ValueCountFrequency (%) 
de Messemaeker, Mrs. Guillaume Joseph (Emma) 1 0.1%
 
Oreskovic, Mr. Luka 1 0.1%
 
Dorking, Mr. Edward Arthur 1 0.1%
 
Balkic, Mr. Cerin 1 0.1%
 
Hocking, Mr. Richard George 1 0.1%
 
Ostby, Mr. Engelhart Cornelius 1 0.1%
 
Foreman, Mr. Benjamin Laventall 1 0.1%
 
Brown, Miss. Amelia "Mildred" 1 0.1%
 
Ivanoff, Mr. Kanio 1 0.1%
 
Barbara, Mrs. (Catherine David) 1 0.1%
 
Other values (881) 881 98.9%
 

Length

Max length82
Mean length26.96520763
Min length12
ValueCountFrequency (%) 
Lowercase_Letter 26 43.3%
 
Uppercase_Letter 25 41.7%
 
Other_Punctuation 5 8.3%
 
Space_Separator 1 1.7%
 
Dash_Punctuation 1 1.7%
 
Open_Punctuation 1 1.7%
 
Close_Punctuation 1 1.7%
 
ValueCountFrequency (%) 
Latin 51 85.0%
 
Common 9 15.0%
 
ValueCountFrequency (%) 
ASCII 60 100.0%
 

Parch
Real number (ℝ≥0)

ZEROS
Distinct count7
Unique (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.38159371492704824
Minimum0
Maximum6
Zeros678
Zeros (%)76.1%
Memory size7.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile2
Maximum6
Range6
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.8060572211
Coefficient of variation (CV)2.112344071
Kurtosis9.778125179
Mean0.3815937149
Median Absolute Deviation (MAD)0
Skewness2.749117047
Sum340
Variance0.6497282437
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0 678 76.1%
 
1 118 13.2%
 
2 80 9.0%
 
5 5 0.6%
 
3 5 0.6%
 
4 4 0.4%
 
6 1 0.1%
 
ValueCountFrequency (%) 
0 678 76.1%
 
1 118 13.2%
 
2 80 9.0%
 
3 5 0.6%
 
4 4 0.4%
 
ValueCountFrequency (%) 
6 1 0.1%
 
5 5 0.6%
 
4 4 0.4%
 
3 5 0.6%
 
2 80 9.0%
 

PassengerId
Real number (ℝ≥0)

UNIFORM
UNIQUE
Distinct count891
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean446.0
Minimum1
Maximum891
Zeros0
Zeros (%)0.0%
Memory size7.1 KiB

Quantile statistics

Minimum1
5-th percentile45.5
Q1223.5
median446
Q3668.5
95-th percentile846.5
Maximum891
Range890
Interquartile range (IQR)445

Descriptive statistics

Standard deviation257.353842
Coefficient of variation (CV)0.5770265516
Kurtosis-1.2
Mean446
Median Absolute Deviation (MAD)223
Skewness0
Sum397386
Variance66231
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
891 1 0.1%
 
293 1 0.1%
 
304 1 0.1%
 
303 1 0.1%
 
302 1 0.1%
 
301 1 0.1%
 
300 1 0.1%
 
299 1 0.1%
 
298 1 0.1%
 
297 1 0.1%
 
Other values (881) 881 98.9%
 
ValueCountFrequency (%) 
1 1 0.1%
 
2 1 0.1%
 
3 1 0.1%
 
4 1 0.1%
 
5 1 0.1%
 
ValueCountFrequency (%) 
891 1 0.1%
 
890 1 0.1%
 
889 1 0.1%
 
888 1 0.1%
 
887 1 0.1%
 

Pclass
Categorical

Distinct count3
Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
3
491
1
216
2
184
ValueCountFrequency (%) 
3 491 55.1%
 
1 216 24.2%
 
2 184 20.7%
 

Length

Max length1
Mean length1
Min length1
ValueCountFrequency (%) 
Decimal_Number 3 100.0%
 
ValueCountFrequency (%) 
Common 3 100.0%
 
ValueCountFrequency (%) 
ASCII 3 100.0%
 

Sex
Categorical

HIGH CORRELATION
Distinct count2
Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
male
577
female
314
ValueCountFrequency (%) 
male 577 64.8%
 
female 314 35.2%
 

Length

Max length6
Mean length4.704826038
Min length4
ValueCountFrequency (%) 
Lowercase_Letter 5 100.0%
 
ValueCountFrequency (%) 
Latin 5 100.0%
 
ValueCountFrequency (%) 
ASCII 5 100.0%
 

SibSp
Real number (ℝ≥0)

ZEROS
Distinct count7
Unique (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.5230078563411896
Minimum0
Maximum8
Zeros608
Zeros (%)68.2%
Memory size7.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q31
95-th percentile3
Maximum8
Range8
Interquartile range (IQR)1

Descriptive statistics

Standard deviation1.102743432
Coefficient of variation (CV)2.108464374
Kurtosis17.88041973
Mean0.5230078563
Median Absolute Deviation (MAD)0
Skewness3.695351727
Sum466
Variance1.216043077
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0 608 68.2%
 
1 209 23.5%
 
2 28 3.1%
 
4 18 2.0%
 
3 16 1.8%
 
8 7 0.8%
 
5 5 0.6%
 
ValueCountFrequency (%) 
0 608 68.2%
 
1 209 23.5%
 
2 28 3.1%
 
3 16 1.8%
 
4 18 2.0%
 
ValueCountFrequency (%) 
8 7 0.8%
 
5 5 0.6%
 
4 18 2.0%
 
3 16 1.8%
 
2 28 3.1%
 

Survived
Boolean

Distinct count2
Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
0
549
1
342
ValueCountFrequency (%) 
0 549 61.6%
 
1 342 38.4%
 

Ticket
Categorical

HIGH CARDINALITY
UNIFORM
Distinct count681
Unique (%)76.4%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
CA. 2343
 
7
1601
 
7
347082
 
7
CA 2144
 
6
3101295
 
6
Other values (676)
858
ValueCountFrequency (%) 
CA. 2343 7 0.8%
 
1601 7 0.8%
 
347082 7 0.8%
 
CA 2144 6 0.7%
 
3101295 6 0.7%
 
347088 6 0.7%
 
S.O.C. 14879 5 0.6%
 
382652 5 0.6%
 
17421 4 0.4%
 
LINE 4 0.4%
 
Other values (671) 834 93.6%
 

Length

Max length18
Mean length6.750841751
Min length3
ValueCountFrequency (%) 
Uppercase_Letter 16 45.7%
 
Decimal_Number 10 28.6%
 
Lowercase_Letter 6 17.1%
 
Other_Punctuation 2 5.7%
 
Space_Separator 1 2.9%
 
ValueCountFrequency (%) 
Latin 22 62.9%
 
Common 13 37.1%
 
ValueCountFrequency (%) 
ASCII 35 100.0%
 

Title
Categorical

HIGH CORRELATION
Distinct count6
Unique (%)0.7%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
Mr
525
Miss
185
Mrs
128
Master
 
40
Dr
 
7
ValueCountFrequency (%) 
Mr 525 58.9%
 
Miss 185 20.8%
 
Mrs 128 14.4%
 
Master 40 4.5%
 
Dr 7 0.8%
 
Rev 6 0.7%
 

Length

Max length6
Mean length2.745230079
Min length2
ValueCountFrequency (%) 
Lowercase_Letter 7 70.0%
 
Uppercase_Letter 3 30.0%
 
ValueCountFrequency (%) 
Latin 10 100.0%
 
ValueCountFrequency (%) 
ASCII 10 100.0%
 

Family_Size
Real number (ℝ≥0)

ZEROS
Distinct count9
Unique (%)1.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.9046015712682379
Minimum0
Maximum10
Zeros537
Zeros (%)60.3%
Memory size7.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q31
95-th percentile5
Maximum10
Range10
Interquartile range (IQR)1

Descriptive statistics

Standard deviation1.613458541
Coefficient of variation (CV)1.783612358
Kurtosis9.15966597
Mean0.9046015713
Median Absolute Deviation (MAD)0
Skewness2.727441474
Sum806
Variance2.603248465
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0 537 60.3%
 
1 161 18.1%
 
2 102 11.4%
 
3 29 3.3%
 
5 22 2.5%
 
4 15 1.7%
 
6 12 1.3%
 
10 7 0.8%
 
7 6 0.7%
 
ValueCountFrequency (%) 
0 537 60.3%
 
1 161 18.1%
 
2 102 11.4%
 
3 29 3.3%
 
4 15 1.7%
 
ValueCountFrequency (%) 
10 7 0.8%
 
7 6 0.7%
 
6 12 1.3%
 
5 22 2.5%
 
4 15 1.7%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

AgeCabinEmbarkedFareNameParchPassengerIdPclassSexSibSpSurvivedTicketTitleFamily_Size
022.0NaNS7.2500Braund, Mr. Owen Harris013male10.0A/5 21171Mr1
138.0C85C71.2833Cumings, Mrs. John Bradley (Florence Briggs Thayer)021female11.0PC 17599Mrs1
226.0NaNS7.9250Heikkinen, Miss. Laina033female01.0STON/O2. 3101282Miss0
335.0C123S53.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)041female11.0113803Mrs1
435.0NaNS8.0500Allen, Mr. William Henry053male00.0373450Mr0
530.0NaNQ8.4583Moran, Mr. James063male00.0330877Mr0
654.0E46S51.8625McCarthy, Mr. Timothy J071male00.017463Mr0
72.0NaNS21.0750Palsson, Master. Gosta Leonard183male30.0349909Master4
827.0NaNS11.1333Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)293female01.0347742Mrs2
914.0NaNC30.0708Nasser, Mrs. Nicholas (Adele Achem)0102female11.0237736Mrs1

Last rows

AgeCabinEmbarkedFareNameParchPassengerIdPclassSexSibSpSurvivedTicketTitleFamily_Size
88133.0NaNS7.8958Markun, Mr. Johann08823male00.0349257Mr0
88222.0NaNS10.5167Dahlberg, Miss. Gerda Ulrika08833female00.07552Miss0
88328.0NaNS10.5000Banfield, Mr. Frederick James08842male00.0C.A./SOTON 34068Mr0
88425.0NaNS7.0500Sutehall, Mr. Henry Jr08853male00.0SOTON/OQ 392076Mr0
88539.0NaNQ29.1250Rice, Mrs. William (Margaret Norton)58863female00.0382652Mrs5
88627.0NaNS13.0000Montvila, Rev. Juozas08872male00.0211536Rev0
88719.0B42S30.0000Graham, Miss. Margaret Edith08881female01.0112053Miss0
88822.0NaNS23.4500Johnston, Miss. Catherine Helen "Carrie"28893female10.0W./C. 6607Miss3
88926.0C148C30.0000Behr, Mr. Karl Howell08901male01.0111369Mr0
89032.0NaNQ7.7500Dooley, Mr. Patrick08913male00.0370376Mr0